Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP][RFC] Add sparse host buffer source #16252

Draft
wants to merge 2 commits into
base: branch-24.08
Choose a base branch
from

Conversation

rjzamora
Copy link
Member

Related to #15919

DISCLAIMER: This is meant to be a rough RFC to illustrate my idea for a temporary mitigation strategy for NativeFile removal. I am probably not the right person to take the libcudf changes over the line but I still threw some C++ code together for fun :)

Idea: We don't need NativeFile support to get reasonable partial-IO performance from remote storage if we only transfer the necessary byte ranges into host memory with fsspec and libcudf is able to read from the sparse <offset, byte-range> mapping. The fsspec component is covered in #16166, but that PR currently wastes a lot of host-memory by copying the necessary byte ranges into a larger "proxy" byte range that matches the size of the actual file.

@vuule @vyasr - Does a "sparse" host buffer source seem like a reasonable approach for the near term?

@rjzamora rjzamora added 2 - In Progress Currently a work in progress improvement Improvement / enhancement to an existing function labels Jul 11, 2024
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Jul 11, 2024
@GregoryKimball
Copy link
Contributor

GregoryKimball commented Jul 26, 2024

Thank you @rjzamora for suggesting this idea. I think a sparse host buffer data source could make sense in libcudf.

The sparse host buffer data source is meant to be an improvement over the current approach which prepares a host buffer data source for reading with the bytes partially populated and the rest uninitialized. The current approach carries a risk for data corruption if libcudf ever reads an uninitialized byte range, whether from a bug in read_parquet or an over-read in JSON/CSV byte range support. If we had a sparse host buffer data source, it could throw if data were accessed outside of valid byte range. There could also be a performance benefit and host memory pressure reduction in using a sparse host buffer data source because we could allocate smaller host buffers with less wasted space.

  • Would you please help us measure the benefits of a sparse host buffer data source? For this test we could compare the read throughput of a 500 MB file versus a 5 GB file with only 500 MB populated.
  • We might want to separate a "single range" sparse host buffer from a "multi range" host buffer.
    • A sparse host buffer representing a parquet file will have a footer byte range, and then byte ranges for each column chunk (+metadata). This sparse host buffer would have multiple ranges, and we would expect read_parquet to never read outside a populated byte range.
    • A sparse host buffer representing a CSV/JSON file (I believe) must be constrained to only use a single byte range, and the range should be sized to allow over-reading to find the end of the last record. If we had multiple byte range in a single sparse host buffer, I'm not sure how that would interact with byte-range reads over the sparse host buffer... (but it would probably be a nightmare). It's likely that sparse host buffer data sources for text files will have a single range anyways.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 - In Progress Currently a work in progress improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code.
Projects
Status: Needs owner
Development

Successfully merging this pull request may close these issues.

2 participants